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Abstract —This paper explores and analyzes two randomized 
designs for robust Principal Component Analysis (PCA) employ¬ 
ing low-dimensional data sketching. In one design, a data sketch 
is constructed using random column sampling followed by low¬ 
dimensional embedding, while in the other, sketching is based 
on random column and row sampling. Both designs are shown 
to bring about substantial savings in complexity and memory 
requirements for robust subspace learning over conventional 
approaches that use the full scale data. A characterization of the 
sample and computational complexity of both designs is derived 
in the context of two distinct outlier models, namely, sparse and 
independent outlier models. The proposed randomized approach 
can provably recover the correct subspace with computational 
and sample complexity that are almost independent of the size of 
the data. The results of the mathematical analysis are confirmed 
through numerical simulations using both synthetic and real data. 

Index Terms —Low Rank Matrix, Robust PCA, Randomized 
Algorithm, Subspace Learning, Big Data, Outlier Detection, 
Sketching, Column/Row Sampling, Random Embedding 

I. Introduction 

Principal Component Analysis (PCA) has been routinely 
used to reduce dimensionality by finding linear projections 
of high-dimensional data into lower dimensional subspaces. 
Such linear models are highly pertinent to a broad range 
of data analysis problems, including computer vision, image 
processing, machine learning and bioinformatics 0-0- 

Given a data matrix D S PCA finds an r- 

dimensional subspace by solving 

min||D-UU^D||F subject to U^U = I, (1) 
u 

where U G is an orthonormal basis for the r- 

dimensional subspace, I denotes the identity matrix and 
the Frobenius norm. While PCA is useful when the data has 
low intrinsic dimension, it is notoriously sensitive to outliers 
in the sense that the solution to 0 can arbitrarily deviate from 
the true underlying subspace if a small portion of the data is 
not contained in this low-dimensional subspace. 

As outliers prevail much of the real data, a large body of 
research has focused on developing robust PCA algorithms 
that are not unduly affected by the presence of outliers. The 
corrupted data can be expressed as 

D = L + C , (2) 
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where L is a low rank matrix whose columns lie in a low¬ 
dimensional subspace, and the matrix C, called the outlier 
matrix, models the data corruption. Two main models for data 
corruption that are in fact incomparable for the most part were 
considered in the literature, namely, element-wise and column¬ 
wise corruption. In the former model, C is an element-wise 
sparse matrix with arbitrary support, whose entries can have 
arbitrarily large magnitudes 0^ 0- In this model, all the 
columns of L may be affected by the non-zero elements of C 
given its arbitrary support pattern. In the column-wise model, 
a portion of the columns of C are non-zero and these non¬ 
zero columns do not lie in the column space of L 0, 0. 
Thus, a portion of the columns of L, the so-called inkers, 
are unaffected by C. This paper focuses on the column-wise 
outlier model according to the following data model. 

Data Model 1. The given data matrix D satisfies the following 
conditions. 

1. The matrix D can be expressed as 0- 

2. rank(L) = r. 

3. The matrix C has K non-zero columns. The non-zero 
columns of C do not lie in the column space of L. Hence, ifT 
is the index set of the non-zero columns of C and U G 

an orthonormal basis for the column space of L, then, 

(I - UU^)C, fO for iGl, (3) 

where is the column of C. 

4. Without loss of generality, it is assumed that 

Li = 0 for iGX , 

where is the column ofh. Define L G as the 

matrix of non-zero columns ofL (the inlier columns) and 
as the number of inlier columns, i.e., N 2 = K -\- N 2 . 

The problem of robust PCA has received considerable 
attention in recent years 0, @-|T7)- However, the state-of- 
the-art robust estimators and matrix decomposition techniques 
are mostly unscalable, which limits their usefulness in big data 
applications. For instance, many of the existing approaches 
rely on iterative algorithms that involve computing a Singular 
Value Decomposition (SVD) of the Ai x N 2 data matrix in 
each iteration, which is computationally prohibitive in high¬ 
dimensional settings. This motivates the work of this paper. 

A. Notation and definitions 

Given a matrix L, |jL|| denotes its spectral norm, ||L||* 
its nuclear norm which is the sum of the singular values, and 
||L||i its fi-norm given by ||L||i = ^ |L(j, j)|, i.e. the sum of 

i,3 


2 


the absolute values of its entries. The norm ||L||i _2 is dehned 
as ||L||i _2 = SIILilb, where ||Li ||2 is the £ 2 -norm of the 

i 

2 * column of L. In an TV-dimensional space, is the 
vector of the standard basis. For a given vector a, |ja||p denotes 
its £p-norm. Two linear subspaces iSi and S 2 are said to be 
independent if the dimension of their intersection n ^2 is 
equal to zero. In the presented algorithms and analysis, we 
make use of the following dehnitions. 


Definition 1. The row space of a matrix L with rank r and 
N 2 non-zero columns is said to be incoherent with parameters 
p,y, rjy and 1 if 


V T II 2 ^ 

e*|l2 < ^ 

I I\2 


% = i/^Diax|V(i,j)| , 


max||V^e,||^ < ^ 
* -^2 


(4) 


where V is an orthonormal basis for the row space of L. 
Similarly, the column space of L is said to be incoherent with 
parameters and rjy if 


max||U^ei||2 < and rjy 

i Ni 


^/Nlmax\lJ{i,j)\ . (5) 
1,3 


Definition 2. (Distributional Johnson-Lindenstrauss (JL) 
Property An m X n matrix $ is said to satisfy 

the Distributional JL property if for any fixed v S K" and 
any e G (0,1), 


P 




> e||v 


< 2e"™ 


/(O 


( 6 ) 


where /(e) > Q is a constant that is specific to the distribution 
of $ and depends only on e. 


We refer the reader to H), |22) for further details concern¬ 
ing the properties of the incoherency parameters. Also, similar 
to Definition we define as the row space incoherency of 
L , i.e., if V G an orthonormal basis for the row 

space of L , max||efV jj^ < 

i ^2 


B. Summary of contributions 

Motivated by the aforementioned limitation of existing ap¬ 
proaches in big data settings, which is to be further elaborated 
in the related work section, this paper explores and analyzes 
a randomized approach to robust PCA using low-dimensional 
data sketching. Two randomized designs are considered. The 
first design is the Random Embedding Design (RED) wherein 
a random subset of the data columns is selected then em¬ 
bedded into a random low-dimensional subspace. The second 
randomized design is a Random Row-sampling Design (RRD), 
in which a random subset of the data columns are sampled, 
then we select a random subset of the rows of the sampled 
columns. Unlike conventional robust PCA algorithms that use 
the full-scale data, robust subspace recovery is applied to the 
reduced data sketch. 

We consider two distinct popular models for the outlier 
matrix. In the first model - the independent outlier model - it 
is assumed that any small subset of the non-zero columns of C 
is not linearly dependent. This model allows for a remarkable 


TABLE I 

Order of sufficient number of random linear data 

OBSERVATIONS. 


Outlier Model/Design 

RED 

RRD 

Column-sparsity 

max(/i„, K/N 2 ) 

max(^i,, K/N 2 ) 

Independence 

max(l, fy K/N 2 ) 

3nax{nl,fy K/N 2 ) 


portion of the data to be outliers. In the second model - the 
sparse outlier model - it is assumed that C is column-sparse, 

i.e., a very small portion of the given data columns are outliers, 
but no assumption is made about the linear dependence of the 
outlying columns. Eor both outlier models, we prove that the 
randomized approach using either of the designs can recover 
the correct subspace with high probability (whp). Some of the 
key technical contributions of this paper are listed below. 

1. To the best of our knowledge, RRD is used and analyzed 
here for the first time for robust PCA with column-wise cor¬ 
ruption. We prove that RRD can recover the correct subspace 
using roughly 0(r'^pyjiy) random linear data observations. 
The complexity of subspace recovery in RRD is roughly 

D(t PyPyf 

2. Eor RED, it is shown here for the first time that the suf¬ 
ficient number of random linear data observations for correct 
subspace recovery is roughly 0{r‘^jiy). 

3. The proposed randomized approach based on the linear 
independence of the outlier columns is novel. We take ad¬ 
vantage of random column sampling to substantially reduce 
the number of outlying columns. Thus, unlike conventional 
approaches that need to go through all the columns to identify 
the outliers, we only need to check Ofyjiy) data points. 

Table |I] summarizes the derived order of sufficient number of 
linear random data observations for the randomized designs 
with both outlier models. 

II. Related Work 

A. Robust PCA 

Some of the earliest approaches to robust PCA relied 
on robust estimation of the data covariance matrix, such 
as S-estimators, the minimum covariance determinant, the 
minimum volume ellipsoid, and the Stahel-Donoho estimator 
IB- However, these approaches are not applicable in high¬ 
dimensional settings due to their computational complexity 
and memory requirements, and we are not aware of any 
scalable algorithms for implementing these methods with 
explicit performance guarantees. 

Another popular approach replaces the Erobenius norm in 
0 with other norms to enhance robustness to outliers l^ . 
An instance of this approach is fO) , which uses an fi- 
norm relaxation, commonly used for sparse vector estimation, 
yielding robustness to outliers 0, ig, |Tg replaces 

the ^i-norm in (PI with the 2 -norm to promote column 
sparse solutions. Recently, the idea of using a robust norm 
was revisited in 0, IB- Therein, the non-convex constraint 
set is relaxed to a larger convex set and exact subspace 
recovery is guaranteed under certain conditions. Nevertheless, 
these approaches are not directly applicable to large matrices 
and high-dimensional data settings. Eor example, the iterative 
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solver of 0 requires an eigen-decomposition operation in 
each iteration. 

An interesting approach for outlier detection was recently 
proposed in p7) , based on the idea that outliers do 
not typically follow low dimensional structures. Hence, few 
outliers cannot form a linearly dependent set. Unlike most 
existing approaches, this approach can recover the correct 
subspace even if a remarkable portion of the data is outliers, 
albeit its computational complexity is roughly 0(A^|) 

Also, the number of samples in | [2^ scales linearly with the 
data dimension, which is quite restraining in high dimensional 
settings. 

In this paper, we propose two randomized algorithms for 
two distinct outlier models. The first algorithm is a new 
randomized approach that exploits the linear independence 


of the outlying columns (cf Section III-Ai. It is shown that 
this randomized algorithm can recover the correct subspace 
with sample complexity that is almost independent of the size 
of the data (roughly It also imposes less stringent 

constraints on the distribution of outliers compared to prior 
work on the independent outlier model. 


The second algorithm presented in Section III-B deals with 
the sparse column outlier model using convex rank minimiza¬ 
tion on reduced data sketches. Robust PC A using convex rank 
minimization p7) was first analyzed in Q, |j^. It was shown 
in that the optimal point of 


I|l|U + a|!c|1i,2 

subject to L -f C = D . 


mm 

L,C 


(7) 


yields the exact subspace and the correct outliers identification 
provided that C is sufficiently column-sparse. The column- 
sparsity of C is the main requirement of 0, i.e., a very small 
fraction of the columns of C can be non-zero. The compu¬ 
tational complexity of 0 is roughly 0 {rNiN 2 ) per iteration 
and the entire data needs to be saved in the working memory, 
which is prohibitive in big data applications. In this paper, 
we show that the complexity of subspace recovery reduces to 
0(r^/i„), which is substantially less than 0 {rNiN 2 ) for high 
dimensional data, using a randomized approach that applies 
0 to reduced data sketches. 


B. Randomized approaches for Robust PCA 

The low rank component L has a low-dimensional structure, 
and so is C in the element-wise sparse or column-wise sparse 
models. These low-dimensional structures motivated the usage 
of randomized algorithms for robust PCA using small sketches 
constructed from random linear measurements of D. 

However, the majority of such algorithms have focused on 
robust PCA with the element-wise outlier model m-iiD. 
For instance, two randomized methods were proposed in 1311 
and 1291 to recover L from small subsets of the columns 
and rows of D. The randomized approach in | |29) was shown 
to reduce complexity from 0{NiN2r) to 0{max{Ni, N 2 )r‘^) 
per iteration. 

Randomized approaches for the column-wise outlier model 
were proposed in pT] and |26|. The algorithm in is built 


on the assumption that any subset of outlying columns with 
cardinality less than A^i is linearly independent. The algorithm 
repeatedly samples Ni data points until a linearly dependent 
set is found, upon which those columns that do not depend 
linearly on the other ones are selected as outliers. Since the 
number of samples scales linearly with the data dimension 
and the algorithm requires 0 {NiN 2 ) iterations on average, it 
can be quite restraining in high dimensions, especially when 
a remarkable portion of the data is ouliers. Another limitation 
of emerges from the assumption that any subset of inliers 
with at least r columns spans the column space of L. This 
may not be true in general, especially with real world data 
which often exhibits clustering structures. 

The work in 1341 considers the column-sparse outlier model. 
The data is first embedded into a random low-dimensional 
subspace, then a subset of the columns of the compressed 
data is selected. The convex program in 0 is then used 
to locate the outlying columns of the compressed data. The 
analysis provided in p4] requires roughly 0 {rN 2 ) random 
linear observations for exact outlier detection. In this paper, 
we show that both the required number of sampled columns 
and the dimension of the subspace for random embedding 
are almost independent of the size of data and the required 
number of random linear measurements is shown to be roughly 

0{r'^p.y). 


HI. Proposed Approach 

In this section, we propose two algorithms for two distinct 
models of the outlier matrix. In the first model, the independent 
outlier model, it is assumed that any small subset of outliers is 
not linearly dependent. The corresponding algorithm is easy 
to implement and can recover the correct subspace even if 
more than 90% of the data is outliers. The second model 
concerns the scenario in which C is column-sparse, yet allows 
for outliers to be linearly dependent. For both algorithms, we 
consider two randomized designs, one utilizing random em¬ 
bedding and the other using random row sampling. We provide 
a full analysis of the sample complexity for the two algorithms 
based on both randomized designs. The randomized algorithms 
can provably retrieve the correct subspace with computational 
and sample complexity that are almost independent of the 
size of D. In this section, we present the algorithms and the 
key insights underlying the proposed approach along with the 
statement of the main theorems. A step-by-step analysis is 
deferred to Sections ITVlandlV] 

A. Algorithm 1: randomized approach for the independent 
outlier model 

Algorithm 1 hinges on the assumption that any small subset 
of outliers are linearly independent as stated next. 

Assumption 1. Any subset of the non-zero columns ofC with 
cardinality equal to q spans a q-dimensional subspace that is 
independent of the column space of L. 

The requirement on q will be formalized later in the section. 
The table of Algorithm 1 presents the algorithm with both 
randomized designs along with the definitions of the used 
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Algorithm 1 Randomized Robust PCA based on outlier linear 
independence with both randomized designs 
Input: Data matrix D S ^ ^2 

1. Data Sketching 

1.1 Column Sampling: Matrix S S columns of D 

randomly = DS. The columns of S are a set of standard basis vectors. 
Thus, Da e 

1.2 Row Compression: 

If we use RED: Matrix ^ E R"^ 2 xiVi drawn from any distribution 
satisfying Matrix ^ projects the sampled columns Ds into a random 
m 2 -dimensional subspace = ^Ds. Thus, E R^ 2 xmi 

If we use RRD: The rows of ^ E ^ subset of standard basis. 

Matrix ^ samples m2 rows of sampled columns Df = Thus, dJ E 

j^m2 Xmi _ 

2. Subspace Learning 

2.1 Sampled Outlier Columns Detection: Define as the column of Df 
and is eq ual to with the column removed. Solve the optimization 
problem |lo[ for 1 < i < mi to identify the outlying columns of (if the 
minimum value of is non-zero, the column is an outlier). 

2.2 Subspace Learning: Construct T as the set of columns of Ds corre¬ 
sponding to a set of linearly independent inlier columns of Dg spanning the 
subspace of the inlier columns of . 

Output: The matrix T is a basis for the column space of L. 


Algorithm 2 Randomized Robust PCA based on outlier matrix 
column-sparsity with both randomized designs 
Input: Data matrix D S ^ ^2 

1. Data Sketching 

Perform steps 1.1 and 1.2 of Algorithm 1. 

2. Subspace Learning 

2.1 Sampled Outlier Columns Detection: Obtain and Cf as the optimal 
solution of 

min A||C^||i,2-f IILtll* 

( 8 ) 

subject to tf + Cf = 'Df . 

The non-zero columns of Cf indicate the location of the outlying columns. 

2.2 Subspace Learning: Construct T as the set of columns of Ds correspond¬ 
ing to a set of linearly independent inliers of Dg spanning the subspace of 
the inlier columns of Ds • 

Output: The matrix T is a basis for the column space of L. 


symbols. The only difference is in step 1.2 as RED uses 
random embedding while RRD uses row sampling. 

Insight: Suppose that Ug columns sampled randomly from L 
span its column space whp. We do not have direct access to L 
but assume that the number of sampled data columns, mi, is 
large enough so that the number of inliers in Dg (the sampled 
data columns) is at least (rig + 1) and the number of outliers is 
less than q whp. In Section it is shown that the sufficient 
values for Ug, mi and the upper-bound q are small and scale 
linearly with r. 

According to Assumption if d* (the ** column of Dg) is 
an inlier, then it must lie in the span of the other columns of 
Dg which contains at least (rig + 1) inliers. By contrast, if d® 
is an outlier, it would not lie in the span of the other columns 
since the selected outliers are not linearly dependent. This is 
the basis for locating the outlying columns of Dg. 

Algorithm 1 solves a low-dimensional outlier identification 
problem by projecting the sampled data Dg in a lower¬ 
dimensional subspace. Specifically, we form the compressed 


matrix 

Df = $Dg, (9) 

where $ G M"® 2 xAfi randomized designs differ in the 
choice of $ in (|^. Specifically, in RED the matrix €> embeds 
the sampled columns into a random low dimensional subspace, 
while in RRD €> samples a random subset of the rows of Dg 
(c.f. table 1). 

In order to ensure that 0 preserves the essential informa¬ 
tion, we derive sufficient conditions to satisfy the following 
requirement. 

Requirement 1. The data sketching has to ensure that: 

1. The rank of is equal to r. 

2. The non-zero columns of $Cg are independent and they 
span a subspace independent from the column space of $L. 

Define d^^, as the column of Df and Qf is equal to Df 
with the z* column removed. In order to locate the outlying 
columns of Df, we solve 

min l|d^5 - , (10) 

Z 

for 1 < z < mi. If the minimum of ( [T0| l is zero (or close 
to zero for noisy data) for the z* column, it is concluded 
that the z* column is an inlier, otherwise it is identified as an 
outlier. Once the outlying columns of Df are detected, we can 
estimate the dimension of the subspace spanned by the inliers 
of D^. If the estimated dimension is equal to f, we find f 
independent inlier columns of D^. Define T as the matrix 
formed from the f columns of Dg corresponding to these f 
independent inliers of TTf. Thus, if the outlying columns of 
Bf are correctly located, T would be a basis for the column 
space of L. 

In many applications, we may also be interested in locating 
the outlying columns. If T spans the column space of L, we 
can easily identify the non-zero columns of C as the non-zero 
columns of (I — T(T^T)^^T^)D . If outlier detection is 
intended, an alternative course for data sketching would be to 
start with row compression followed by column sampling. This 
is particularly useful in a distributed network setting, in which 
each agent sends a compressed version of its data vector to a 
central processor as opposed to centralizing the entire data. As 
such, the central unit would work with D'^ = $D. A random 
subset of the columns of D*^ is then sampled to form D^, 
and subspace learning is applied to Df to learn the column 
space of €>L. If U'^ denotes the obtained orthonormal basis 
for the column space of 4>L, then the non-zero columns of C 
are identified as the non-zero columns of 

H = (I-U'^(U'^)^)D‘^ . (11) 

We can readily state the following theorems, which establish 
performance guarantees for Algorithm 1 with both randomized 
designs. 

Theorem 1 (Sufficient Condition-Algorithm 1 with RED). 
Suppose D follows Data Model 1, Assumption 1 is satisfied, 
mi columns are sampled randomly with replacement and any 
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repeated columns are removed. If for fixed c > 1 and small 
0 < (5 < 1/5, 


N 3 2 

/3>2 + -log- 
A'2 ad 


1712 ^ max 


(r + q) log(42V2) + log 


q = a 

2 

(5 


K 


( 12 ) 


/(i) 

(r + 1) log(42y2) + log if + log | 

m 

where the embedding m 2 x Ni matrix $ is drawn from any 

distribution satisfying (0 and 

f ' 4:7' 9 2 'I 

a = max|20^^rlogy,3c^^log-j, (13) 

then Algorithm 1 with RED yields the exact subspace and 0 
identifies the non-zero columns of C correctly with probability 
at least 1 — 5<5. 

The following theorem is the counterpart of Theorem[2with 
RRD. In this paper, for the analysis of RRD we assume that the 
non-zero entries of C are sampled from a zero-mean normal 
distribution. 

Theorem 2 (Sufficient Condition- Algorithm 1 with RRD). 
Suppose D follows Data Model 1, mi columns are sampled 
randomly with replacement and any repeated columns are 
removed, m 2 rows are sampled randomly without replacement 
and the non-zero elements of C are sampled independently 
from a zero-mean normal distribution. If for fixed c > 1 and 


small 0 < (5 <C 1/6, mi, /3 and q follow {12\, a is equal to 
dii] ) and 


m 2 > max 


rrj^ max ( ci log r, C 2 log 


2 / 2 
r-f g-f 2 log - -f WSglog- 


0' 


designs. Algorithm 2 differs from Algorithm 1 in the subspace 
learning step since we do not assume that the outliers are 
linearly independent. Instead, subspace learning relies on 
the column-sparsity of C and the convex algorithm 0 is 
used in the subspace learning step. This implies a different 
requirement for the row compression step stated as follows. 

Requirement 2. The data sketching has to ensure that: 

1. The rank of is equal to r. 

2. The non-zero columns of do not lie in the column 

space of #L. 

It is worth noting that the randomized approach substantially 
reduces the complexity of 0- If Q is applied directly to D, 
the complexity would be 0{NiN2r) per iteration 0. With 
the randomized approach, we show that the complexity of 
the subspace learning step is almost independent of the size 
of the data. The following theorems establish performance 
guarantees for Algorithm 2 with both randomized designs. 

Theorem 3 (Sufficient Condition-Algorithm 2 with RED). 
Suppose D follows Data model 1, the matrix $ is drawn from 
any distribution satisfying 0. and the columns of Tig are sam¬ 
pled randomly with replacement. If for small 0 < (5 <C 1/3, 

No No 

5 > ^ (1-f 6r/x„(121/9)), 


K ^ -(l + 6rM„(121/9)) 

K ~ 5 (l + 6r^„(121/9)) ’ 


(15) 


m 2 > 


(r -f 1) log(42i/2) -f log AT -I- log 

M ^ 


where 


Q = max 


No 


2r 


(14) 




(16) 


, 2K , 2K 

r -f 1 -f 2 log — -f y 8 log — 


where Ci and C 2 are constant numbers, then Algorithm 1 with 
RRD yields the exact subspace and (|ZZ]» identifies the non-zero 
columns of C correctly with probability at least 1 — 66. 

Remark 1. In practice, the number of outliers is smaller 
than the number of inliers. Therefore, ^ < 1 ( albeit this is 

not necessary for Algorithm 1). Suppose that Cp,„r log ^ > 
log |, where C is a constant number. According to (|72 


then Algorithm 2 with RED recovers the exact subspace 
and 0 correctly identifies the non-zero columns of C with 
probability at least 1 — 3i5. 

Theorem 4 (Sufficient Condition- Algorithm 2 with RRD). 
Suppose D follows Data model 1, the columns of Dg are 
sampled randomly with replacement, and the rows are sampled 
randomly without replacement. In addition, it is assumed that 
the non-zero elements of C are sampled independently from 
a zero-mean norm al di stribution. If f or 0 <(5^1/4, mi, g 
and K/N 2 follow {15), C is equal to (16) and 


it is almost sufficient to choose /3 = 2. Therefore, the sufficient 
number of randomly sampled columns mi > 4C'/r^r’log^ , 
i.e., mi scales linearly with r/i„ log 4r. The number of sam¬ 
pled outliers is 0(miK/N2). Thus, the sufficient value for 
m 2 for Algorithm 1 is 0(max(r, with RED and 

0{max{rr]1, with RRD. 

-™2 

B. Algorithm 2: randomized approach for the column-sparse 
outlier model 

The table of Algorithm 2 details the randomized approach 
based on the column-sparsity of C with both randomized 


m 2 > max 


rrjl^ max ci log r, C 2 log ( - 


r -f 1 -f 2 log 2K/5 + y/8 log 2K/5 


(17) 


then Algorithm 2 with RRD recovers the exact subspace 
and (23 correctly identifies the non-zero columns of C with 
probability at least 1 — 45. 


Remark 2. If we choose 

N' 


5 = 2^(l + 6r^„(121/9)) 
I\2 


( 18 ) 
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then the sufficient conditions 0 can be rewritten as 

K ^ N2/2N'^ 

iV 2 - l + 6r/i„(121/9) 

mi>max^l2^(l+6r^„(121/9))^og ^ , 10r^„logy^ 

Thus, mi for Algorithm 2 is 0(max(r/j,y, Accord¬ 

ing to 0 and ( |77| ), the sufficient value for m 2 is roughly 0{r) 
with RED and 0(rrff) with RRD. In addition, the permissible 
number of outliers scales linearly with N 2 , i.e., not restricted 
to a sublinear sparsity regime. 


Lemma 6. Suppose that mi = columns of the given 

data matrix are sampled uniformly at random with replace¬ 
ment. If 

/3>2 + -log^, (21) 

a d 

then the number of inlier columns of Dg is greater than or 
equal to a with probability at least (1 — 5). 

According to pT) , it is almost sufficient to choose (3 = 2. 

In addition, in most applications, ^ < 2. Therefore, if 4a 

^2 

columns are sampled at random, the sampled columns will 
contain at least a randomly sampled inliers. 


IV. Analysis of Algorithm 1 

In this section, we provide a setp-by-step analysis of Al¬ 
gorithm 1. The proofs of the main theorems, the lemmas and 
the intermediate results are deferred to the appendix. First, 
we establish a sufficient condition on the number of sampled 
columns mi to guarantee that each inlier of Dg lies in the span 
of the other inliers of Dg. Based on the number of sampled 
columns, we readily obtain an upper bound on the number of 
outlying columns in Dg. Then, we derive a sufficient condition 
for to satisfy Requirement 1. 


A. Random sampling from low rank matrices 

In the randomized approach, the column space of L is 
learned from a small random subset of the columns of D. 
Therefore, we first have to ensure that the selected inliers 
span the column space of L. Initially, let’s assume that 
L G is given. Suppose that L' = U's'(V')^ is 

the compact SVD of L/ where U' G V' G 

and S G The following lemma establishes a sufficient 

condition for a random subset of the columns of a low rank 
matrix to span its column space. 

Lemma 5. Suppose Us columns are sampled uniformly at 
random with replacement from the matrix L with rank r. If 

iis > lO/rlTlogy, (20) 

then the selected columns of the matrix L span the column 
space ofE with probability at least (1 — S). 

Hence, the column space of a low rank matrix L can be 
captured from a small random subset of its columns when its 
row space is incoherent with the standard basis. 


B. Random column sampling from data matrix D 

Let a = 20/r„r log Based on Lemma|^ the inliers in Dg 
span the column space of L and each inlier of Dg lies in the 
span of the rest of the inliers of Dg whp if the number of inliers 
in Dg is at least a. Suppose we sample mi = (3aN2/N2 
data columns randomly from D, where j3 > 1. The following 
lemma provides a sufficient condition on f3 to ensure that the 
number of selected inliers exceeds a. 


C. Selected outlying columns 

The advantage of column sampling in the randomized 
approach is two-fold. First, complexity is substantially reduced 
since we only need to process a small subset of the data. 
Second, the number of outliers in Dg is significantly smaller 
than the total number of outliers, which in turn relaxes the 
requirement on the spark of C considerably. To clarify, robust 
PCA algorithms built on the linear independence assumption 
of the outlier columns as f26\ require every subset of outliers 
with cardinality less than (A^i -f 1) to be independent. In con¬ 
trast, Algorithm 1 only requires independence for significantly 
smaller subsets of selected outliers. The following lemma 
establishes an upper-bound on the number of selected outliers. 

Lemma 7. Suppose that mi = columns of the given 

^2 

data matrix are sampled uniformly at random with replace¬ 
ment. If 


^ , 2 
a > 3c ^log^, 
N 2 6 


( 22 ) 


then the number of outliers selected is bounded from above by 

(23) 


//3iT 1 

\N2 c 


with probability at least (1 — 5), where c is any number greater 
than 1. 


D. Row compression 

In this section, we establish sufficient conditions on m 2 
to satisfy Requirement 1. Suppose Dg contains k outlying 
columns. Thus, given Assumption 1, the rank of Dg is equal 
to r-\-k. Requirement 1 is clearly satisfied if the rank of 4>Dg 
is equal to the rank of Dg. The following lemmas provide 
sufficient conditions for m 2 with both randomized designs. 


Lemma 8. Suppose Dg contains at most q outlying columns 
and assume that $ is an m 2 x A^i matrix satisfying the 
distributional JL property with 


m2 > 


(r -f q) log(42y2) -f log | 

m 


(24) 


Then, the rank of $Dg is equal to the rank of Dg with 
probability at least (1 — 5). 
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Lemma 9. Suppose contains at most q outlying columns, 
the rank of its low rank component is equal to r, the 
non-zero elements of C are sampled independently from a 
zero-mean normal distribution, and the rows of are m 2 
randomly sampled (without replacement) rows 0 /Dg. If 


m 2 > max 


riu max ( ci log r, C 2 log ( - 


r + g + 21og - + WSglog- 


(25) 


where Ci and C 2 are constant numbers, then the rank of Dg Fig- 
is equal to the rank of with probability at least 1 — 25. 



V. Analysis of Algorithm 2 


Similar to the analysis of Algorithm 1 in section IV 


can make use of Lemma |5] to derive a sufficient condition 
on 777.2 to ensure that the rank of Lg is equal to the rank of 
L. The number of selected outliers can also be bounded in 
a similar way. The analysis in Q established that 0 yields 
exact outlier identification if C is sufficiently column-sparse. 
If C is sufficiently sparse, Cg is also a column-sparse matrix 
whp. Thus, we just need to ensure that is a representative 
data sketch with sufficient information. The following lemmas 
establish sufficient conditions on m 2 for the row compression 
step to satisfy Requirement 2 for both RED and RRD. 


Lemma 10. Suppose Dg contains at most q outlying columns 
and assume that $ is an m 2 x A^i matrix satisfying the 
distributional JL property with 


m2 > 


(r -I- 1) log(42V2) -I- log g -I- log | 


(26) 


Then, Requirement 2 is satisfied with probability at least 1 — A 


Lemma 11. Suppose the rank o/Lg is equal to r, Dg contains 
at most q columns, the non-zero elements of C are sampled 
independently from a zero-mean normal distribution and the 
rows o/D^ are m 2 randomly sampled (without replacement) 
rows of Dg. If 


m 2 > max 


rril max ci log r, C 2 log - 


2q 


r -f 1 -f 2 log — -f W 8 log — 


2q 


(27) 


then Requirement 2 is satisfied with probability at least 1 — 25. 


VI. RED VERSUS RRD and Complexity Analysis 
While the row compression step for RED has computational 
complexity 0 {mim 2 Ni) if we start data sketching with col¬ 
umn sampling or 0 (m 2 NiN 2 ) if we start data sketching with 
row compression, this step incurs no computational complexity 
in RRD. Hence, RRD may be more favorable for big data due 
to its reduced computational complexity. However, concerning 
sample complexity, random embedding is generally a more 


effective data sketching tool since the random projection 
matrix is not coherent with the data. To clarify, consider the 
extreme scenario where r = 2, L G ]^ 2000 x 2000 
two rows of L are non-zero. In this scenario, one needs to 
sample more or less the entire rows to ensure that the rank 
of L-^ = $L is equal to 2, i.e., m 2 has to be equal to 2000. 
In contrast, projecting the data into a random subspace with 
dimension equal to 2 is almost sufficient to ensure that the rank 
of L”^ is equal to 2 whp, i.e., m 2 = 2 is nearly sufficient. As 
another example, consider a matrix G G ]j2000x3000 generated 
by concatenating the columns of matrices Gj,7 = 1,...,77, 
as G = [Gi G 2 ... G„] and assume that L = G^. 

Eor 1 < i < f, G, = U,Q, , where U, G K^ooox^^ 

Qj G Eor n/2 -\- 1 < i < n, Gi = , where 

U- g k 2000 x^^ Q., g The elements of U., and Q,, 

are sampled independently from a normal Af{0, 1) distribution. 
The parameter r is set equal to 50, thus, the rank of L is equal 
to 50 whp. Accordingly, the rows of L lie in a union of low¬ 
dimensional subspaces and if 77 > 1, the distribution of the 
rows of L in the row space of L will be highly non-uniform. 
Fig. [T] shows the rank of $L versus m 2 . When ?7 = 1, the 
rows of L are distributed uniformly at random in the row space 
of L. Thus, r rows sampled uniformly at random are enough 
to span the row space of L. But, when n = 50, we need to 
sample almost 500 rows at random to span the row space. On 
the other hand, embedding the data into a random subspace 
with dimension 50 is almost sufficient to preserve the rank of 
L even if 77 = 50. 

A. Computational complexity analysis 

The randomized approach consists of three steps: data 
sketching, subspace recovery and outlier detection. The 
data sketching step for RED has computational complexity 
0 {mim 2 Ni) if data sketching starts with column sampling 
and 0 ( 7772 ^ 1 ^ 2 ) if it starts with row compression. Yet, this 
step has little impact on the actual run-time of the algorithms 
as it only involves a basic matrix multiplication operation 
for data embedding. Data sketching incurs no computational 
complexity in RRD. The complexity of subspace recovery 
is roughly 0(m\m2) and 0(rmim2) for Algorithms 1 and 
2, respectively. The outlier detection step ( [TT| l has complex¬ 
ity 0{m\N2). As subspace learning and outlier detection 
(if intended) dominate the mn-time of the algorithms, the 
randomized approach brings about substantial speedups in 

















TABLE II 

Running time of randomized Algorithm 2 with outlier 

DETECTION AND THE ALGORITHM IN (^. 


II 

Algorithm 2 RED 
+ outlier detection 

Algorithm 2 RRD 
+ outlier detection 


1000 

0.5 s 

0.5 s 

30 s 

5000 

0.6 s 

0.6 s 

450 s 

10000 

1 s 

0.6 s 

2500 s 

20000 

2 s 

0.7 s 

12000 s 


comparison to approaches that use the full-scale data. This is 
so given that the sufficient values for mi and m 2 are almost 
independent of the size of the data (cf. Section IIIi, hence 
the randomized approach evades solving high-dimensional 
optimization problem. In contrast, solving (j^ for example has 
complexity 0{rNiN2) per iteration. Table compares the 
run time of Algorithm 2 to the corresponding non-randomized 
approach with outlier detection. In this example, r = 20, 
mi = 400 and m 2 = 100. The randomized approach (even 
using RED) is remarkably faster than the non-randomized 
approach. 


VII. Noisy data 

In practice, noisy data can be modeled as 


D = L-f C-f N, 


(28) 


where N is an additive noise component. In |j^, it was shown 
that the optimal point of 


min A||C||i,2 + ||L|U 

L,C 

subject to ||L -f C — D 


(29) 


is equal to the optimal point of 0 with an error proportional 
to the noise level. The parameter e„ has to be chosen based on 
the noise level. This modified version can be used in Algorithm 
2 to account for the presence of noise. 

Recall that Algorithm 1 is built on the idea that outliers 
of Df cannot be constructed from, or well-approximated 
by, linear combinations of the other columns of Df'. In the 
presence of noise, we further need to ensure that an outlier 
cannot be obtained from linear combinations of the columns 
of Nf = $NS. If an outlier lies in the span of the columns of 
N^, the coefficients in the linear combinations of the columns 
of Nf would have to be fairly large given that the columns 
of N have small Euclidean norm. Thus, to make Algorithm 1 
robust to noise, we add a constraint to ( [TOl l as follows 

min l|d^5 - s.t. ||z||p < w , (30) 

Z 


where p > 1 and oj is adjusted w.r.t. the noise level. 


VIII. Numerical Simulations 

In this section, we present some numerical experiments to 
study the requirements and performance of the randomized 
approach. The numerical results confirm that the sample 
complexity of the randomized methods is almost independent 
of the size of data. Eirst, we investigate different scenarios 
using synthetic data. Then, the performance and requirements 
of the randomized algorithms are examined with real data. 


A. Phase transition plots with synthetic data 

In this section. The low rank matrix is generated as a 
product L = U^V^, where G G 

The elements of Ur and are sampled independently from 
a standard normal A/^(0,1) distribution. The columns of C are 
non-zero independently with probability p. Thus, the expected 
value of the number of outliers columns is pA^ 2 - The non-zero 
entries of C are sampled independently from A/^(0, 20^). The 
phase transition plots show the probability of correct subspace 
recovery for the pairs of ( 7711 , 7712 ). White designates exact 
subspace recovery and black indicates incorrect recovery. In all 
experiments presented in this section the data is a 2000 x 4000 
matrix except for the simulation in Eig. 

Fig. 1^ shows the phase transition of Algorithm 1 with RED 
for different values of r. When r is increased, the required 
values of mi and m 2 increase as we need more samples to 
ensure that the selected columns span the column space of 
L, as well as a higher dimension for the embedding subspace 
given that the column space of L has a higher dimension. 
Fig. 0 shows a similar plot with RRD. Since in this section 
the columns/rows of L are distributed uniformly at random in 
the column/row-space of L, RED and RRD yield a similar 
performance. As such, for the remaining scenarios in this 
section we only provide phase transitions with RED (RRD 
yields the same performance). 

Fig. 0 illustrates the phase transition for Algorithm 1 
with RED for different values of p. Increasing p has only 
minimal effect on mi (which is almost around 25) because 
the required number of sampled columns depends linearly 
on Therefore, when p is increased from 0.2 to 0.7, 

r (i^p) increases from 1.25r to 3.3r. It is interesting to observe 
that when the number of sampled columns is increased, the 
required m 2 also increases. This is due to the fact that the 
number of sampled outlier columns increases as we sample 
more columns. Subsequently, the selected outliers span a 
subspace with a higher dimension, wherefore we need a 
random subspace with higher dimension for embedding the 
sampled columns because to ensure that the rank of $Ds is 
equal to the rank in Algorithm 1. 

The phase transition plots for Algorithm 2 with RED are 
shown in Eig. for different values of p and r. In the left 
plot, r = 5 and p = 0.01. With mi > 100 and m 2 > 50, 
the algorithm yield correct output whp. In the middle plot, 
the rank is increased to 10. Thus, the required values for mi 
and m 2 increase. In the right plot, p = 0.2 and Algorithm 2 
cannot yield correct subspace recovery since 0 requires C to 
be column-sparse (roughly requiring p < 0.05). 

Fig. 1^ shows the phase transition of Algorithm 1 with RED 
for data matrices with different dimensions. Although the size 
of the data is increased from 2000x4000 to (5x 10"*) x 10®, the 
required values for mi and m 2 remain unchanged confirming 
our analysis, which revealed that the sample complexity of 
the proposed approach is almost independent of the size of the 
data. In this simulation, since the columns/rows are distributed 
randomly, the column space and row space of L have small 
incoherence parameters HD- Thus the factors dominating the 
sample complexity are r and p. 
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r = 20,p = 0.2 


r = 45 , p = 0.2 


= 70,p = 0.2 



100 200 300 


100 200 300 


100 200 300 


Fig. 2. Phase transition plots of Algorithm 1 with RED 


r = 20,p = 0.2 r = 45,p = 0.2 r = 70 , p = 0.2 



^1 rn^ 

Fig. 3. Phase transition plots of Algorithm 1 with RRD 


r = 5,p = 0.2 


r = 5,p = 0.5 


r = 5,p = 0.7 



Fig. 4. Phase transition plots of Algorithm 1 with RED 


r = 5,p = 0.01 



r=10,p = 0.01 r = 5,p = 0,2 



Fig. 5. Phase transition plots of Algorithm 2 with RED. 




Fig. 10. The dimension of and h versus the value of m 2 . 


B. Phase transition with real data 

In this section, we study the requirements of the randomized 
approach with real data for motion tracking and segmentation. 
The data is generated by extracting and tracking a set of points 
throughout the frames p5| . The data is a low rank matrix, 
and the motion data points lie in a union of low-dimensional 
subspaces. We use one of the scenarios in Hopkinsl55 p5) . 
This data matrix is 62 x 464 and its rank is roughly equal to 4. 
We add 50 outlying data points. Thus, the final data is 62x512. 
Fig. 1^ is the phase transition of Algorithm 1 with RED and 
RRD showing the probability of correct outlier identihcation. 
When mi and m 2 are greater than 10, the algorithm yields 
exact outlier detection whp. 


N, = Nj / 2 = 2000 


N, = N2/2 = 20000 N, = N2/2 = 50000 



100 200 300 100 200 300 100 200 300 

Fig. 6. Phase transition plots of Algorithm 1 with RED (r = 20 , p = 0.2). 


Algorithml with the RED Algorithm 1 with the RRD 



10 20 30 40 10 20 30 40 


Fig. 7. Phase transition plots of Algorithm I with both RED and RRD applied 
to motion tracking data. 


C. Sufficient values for m 2 with face images 


Vectorized images are high dimensional data vectors. Thus, 
if they construct low dimensional subspaces, substantial reduc¬ 
tions in computational complexity and memory requirements 
can be achieved through the row compression operation of 
the randomized approach. In this experiment, we use the face 
images in the Extended Yale Face Database B as inlier 
data points. Fig. displays a random subset of these faces. 
This database consists of face images from 38 human subjects, 
and the images of each subject lie in a low-dimensional 
subspace ||3^. According to our investigations, the dimension 
of the face images (38 faces) is roughly equal to 33. We 


randomly sample 350 images of the CaltechlOl database |37| 
as outlying data points. Fig. [^displays a randomly chosen set 
of the images in the CaltechlOl database. Define U as a basis 


for the subspace of the faces, U'^ = $U and 


as 


= (I - 


(31) 
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In addition, define 




(32) 


L =L S 


where S S 


SVD of L , ( |33| l can be rewritten as 

l1 = u's'(v')^s' 


E 


(V 


= 0 . 


The matrix (V')^S'(S')^V' can be written as 

ris 

{V'fs'(S'fv’ =J2iVfs’{s:fv'. 


Us 


(vysisyY 


is sufficiently small, we can conclude that (V )^S (S )^V 


is full rank. According to (35 1 and (361, the matrix 


(v')^s'(syv' - 

^ ^ ’ N' 


pI = max{|iE[XfcX^]||,||E[X^Xfe]|l} and ||Xfe|| < M 
almost surely for all k. Then for any r > 0 

L 


Thus, by measuring the dimension of the span of and the 
value of h, we can observe if the row compression operation 
preserves the essential information since the dimension of 
U<^ is the rank of of the low rank component and h is 
proportional to the norm of the components of the outlying 
data points which do not lie in the column space of the low 
rank component. Fig. 10 shows the dimension of and the 


Ex. 


> T 


k=l 

{di + d 2 ) exp 


< 


-rV2 


(39) 


,Y.k=iPl+ Mt/3^ 


In our problem, X.; = (V )^Sj(Sj)^V — ^I. If the 


N' 


values of h versus m 2 for both random embedding and random 
row sampling. Although the dimension of the data vectors is 
32256, it is shown that 300 random linear measurements of 
the data vectors are nearly sufficient to preserve the rank of L 
and the outlying component of C. 

IX. Appendix 

Proof of lemma H] 

The matrix of sampled columns can be represented as 


matrices A and B are positive definite, then ||A — B|| < 
max{||A|j, ||B||}. Thus, we can derive M as follows 


{V'fs'ysyv - 


1 


< max{ (V')^s'(Si)^V' 
We also have 




TV' 


(40) 


E 


{Vfs'ys-fv' - ^I) ((vys'(s')^v' - 


N' 


(33) 


selects the columns to sample. Using the 


E 


(Vys'(s')^V'(V)’^s,(s,)'^’V - 




1 




(34) 


Therefore, if the matrix (V )^S is full rank, the selected 
columns of L span its column space. 

Define as the column of S . The vector can be any 
of the vectors of the standard basis with equal probability since 
we are using random sampling with replacement. Therefore, 


< max • 


< max ■ 


E 


rPv 

K 


(V )^s,(sj^v (V fsys.fv 

1 1 


N' 


1 


E 


(vys;(s;)^v' 


< 


’{Ky 

rPv 


(Ky - ■ 


Therefore, according to Lemma [T^ if we set 


(35) 


(36) 


then. 


TV, 


28 / , 2r 

ns > -jfPv log -j 




1 

>2 


< (5 . 


(41) 


(42) 


If CTi and ar denote the largest and smallest singular values 
of (S )^V , respectively, then 


If (V')^S'(S')^V' is a full rank matrix, then (V')^S' is also 
full rank. In addition, if we can show that 


2N: 


7 < erf < < 


3tIs 


(43) 


(37) 


(38) 


is a sum of Ug independent zero-mean random matrices. Thus, 
we use the non-commutative Bernstein Inequality p8) to 
bound the spectral norm of 1 


Accordingly, the matrix (S )^V is a full rank matrix with 
probability at least 1 — i5. 

Proof of Lemma |6] 

Since we use random sampling with replacement, the number 
of inliers in the selected columns follows a Binomial distri¬ 
bution. Suppose rii is the number of sampled inlier columns. 
Then, rii is a Binomial random variable with mi independent 
experiments, each with success probability Therefore, 
using Chernoff bound for Binomial distributions j39[, we have 


’ (a < rii < a{2l3 — 1)) > 1 — 2exp ( — 


3a/3 


Lemma 12 (Non-commutative Bernstein Inequality 1381). 
Let Xi,X 2 ,...,Xl be independent zero-mean 
random matrices of dimension di x ^ 2 - Suppose 


(44) 


Thus, if /3 > 2 -I- ^ log |, the RHS of (44i is lower-bounded 
by (l-(5). 
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Proof of Lemma 0 

Since we use random sampling with replacement, the number 
of outliers rio in the matrix follows a Binomial distribu¬ 
tion with mi independent experiments, each having success 
probability Using Chernoff bound we have that 


P 





>1 — 2 exp 



(45) 

(46) 


Therefore, the RHS of ( |46| ) is greater than 1 
satisfied. 


S if (221 is 


Proof of Lemma |8] 

To prove Lemma and Lemma we make use of the 
following result from pO] and | |40| . 


Lemma 13. Let denote a union of n linear subspaces 

in each of dimension at most d. For fixed 6 G (0,1) and 
e G (0,1), suppose $ is an m 2 x 7Vi matrix satisfying the 
distributional JL property with 

dlog(42/e) + logn + log f 

- — m —■ ‘ ’ 


Then, 


(l-e)||v||<||$v||<(l + e)l|v|| (48) 

holds simultaneously for all v G with probability at 

least (1 — 5). 


According to Lemma 13 if m 2 satisfies (24i, then (481 
holds with e = l/v/2 for all the vectors in the column space 
of Ds with probability at least 1 — 5. If (481 holds for all 
V G span(Ds), then it is straightforward to show that the 
rank of ^Dg is equal to the rank of Dg. 

Proof of Lemma UOl 

Suppose Dg contains k outlying data points. Assume 
{U7i}f_i represents a union of k linear subspaces in 
where each subspace is spanned by {U, Ci} and is the 
non-zero column of Cg. According to Data model 1, the 
subspace 7) is an (r-f 1)-dimensional subspace since Ci does 
not lie in the column space of L. Suppose $ is a stable 
embedding of the union of subspaces {U7i}^_i. Then, the 
dimension of the subspaces is not changed during the 

embedding operation. Accordingly, the columns of $Cg do 
not lie in the column space of $L. Note that q > k. Thus, 
according to Lemma [T^ if 


m2 > 


(r -I- 1) log(42V2) -f log g -I- log | 

m ' 


(49) 


then the rank of $L is equal to the rank of L and the non-zero 
columns of $Cg do not lie in the column space of $L, with 
probability at least 1 — 5. 

Proof of Lemma 0 

Since the rank of Lg is equal to r, L and Lg have the same 
column space. Suppose Cg contains k outlying columns. We 
break this proof into two steps. In the first step, it is shown that 


the rank of Lg is equal to the rank of $Lg whp. Define 
as an orthonormal basis for the complement of the column 
space of L^ = 4>Lg. If the rank of L^ is equal to r, then 
Uf"*" e R™ 2 x(m 2 -r) jjj jjjg second step, it is proven that the 
rank of 

{Vffct (50) 


is equal to k whp. The matrix (50i is the projection of the 
columns of ^ onto the complement of the column space of 
Lf. Lemma ^ follows if these two requirements are satisfied. 
For the first part, we make use of the following Lemma from 


Lemma 14. Suppose m 2 rows are sampled uniformly at 
random (without replacement) from the matrix L with rank 
r. If 


m 2 > rrf^ max 


^Ci logr, C 2 log 



(51) 


then the selected rows of the matrix L span the row space 
of L with probability at least (1 — 5), where Ci and C 2 are 
numerical constants. 


The matrices Lg and L have the same column space. Thus, 
if m 2 satisfies (|5T]., the rank of $Lg is equal to the rank 
of Lg with probability at least 1 — 5. Now we prove the 
second part. Assume the first part is satisfied, i.e., the rank 
of Lf is equal to r. It is easy to show that since Uf is 
an orthonormal matrix, then the elements of matrix are 
zero-mean independent normal random variables with equal 
variance. In order to show that the rank of ( [50| ) is equal to k, 
we make use of the following lemma from |41[ , | |42| . 

Lemma 15. Let A be an N x n matrix whose entries are 
independent standard normal variables. Then for every t > 
v/2iog2/5, 

Vn — — t 

^ ^min (A)< ^max (A)<y/N + yfn + t (52) 


with probability at least 1 — 5, where amin{.^) o.nd Cmaxi-^) 
are the minimum and maximum singular values of A. 


Define Z as the non-zero columns of the matrix in (1^. 
Based on Lemma (15 1 , to prove that the rank of Z is equal to 
k with probability at least 1 — 5, it suffices to have 

Vm 2 - r - ^/q> a/ 2 log 2/5. 


Proof of Lemma [IT] 

Similar to the proof of Lemma we can guarantee that if m 2 
satisfies inequality ( |27] i, then the rank of $L is equal to the 
rank of L with probability 1 — 5. 

Suppose c is a non-zero column of Cg. Similar to the 
analysis provided in the proof of Lemma if 

^/m 2 - r - 1 > y /2 log2/5 , 


then €>c do not lie in the column space of 4>L with probability 
at least 1 — 5. Thus, if 

m 2 > r -f 1 -I- 21og2g/5 -I- 2-^2 log2g/5 , 


then the non-zero columns of $Cg do not lie in the column 
space of $L with probability at least 1 — 5. 
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Proof of Theorem [T] 

Algorithm 1 with RED recovers the exact subspace if: 

[I] The inliers of span the column space of L, and each 
inlier of Dg lies in the span of the other inlier columns of Dg. 

[II] If Dg contains k outlying columns, the rank of $Dg is 
equal to r + fc. 

Lemma |3 and Lemma |6] establish a sufficient condition for 
mi to guarantee [I] whp. Given Assumption 1, the rank of 
Dg is equal to r + k. Lemma [^provides a sufficient condition 
for 7712 to ensure that the rank of ^Dg is equal to the rank of 
Dg whp, i.e., [II] is guaranteed whp. In addition. Lemma 
provides an upper-bound on the number of sampled outliers. 
Therefore, according to Lemma Lemma Lemma and 
Lemma 1^ if ( [T2| is satisfied. Algorithm 1 with RED recovers 
the correct subspace with probability at least 1 — 4(5. 

In addition, similar to the analysis provided in the proof of 
Lemma [TOl if 

(r + 1) log(42V2) + log a: + log | 
m2 > - ^ , (53) 

/ ( 2 7 


then the non-zero columns of do not lie in the column 
space of with probability at least 1 — 5. Thus, if the 
subspace is learned correctly, ([n) identifies the outlying 
columns correctly with probability at least 1 — 5. 


Proof of Theorem 

The proof of Theorem is similar to the proof of Theorem 
T but we need to make use of Lemma (instead of Lemma 
81 to guarantee [II] whp. Therefore, according to Lemma 
Lemma Lemma [9] and Lemma [7j if the requirements of 
Theorern]^ are satisfied. Algorithm 1 with RRD recovers the 
correct subspace with probability at least 1 — 55. In addition, 
similar to the analysis provided in the proof of Lemma 11 if 


, 2a: 

7772 > r -I- 1 -f 2 log + 
5 



(54) 


then the non-zero columns of $C do not lie in the column 
space of with probability at least 1 — 5. Thus, if the 
subspace is learned correctly, o identifies the outlying 
columns correctly with probability at least 1 — 5. 

Proof of Theorem |3] 

In order to guarantee that Algorithm 2 recovers the exact 

subspace, we have to ensure that 

(a) The columns of Lg span the column space of L. 

(h) Requirement 2 is satisfied. 

(c) The optimization problem ([^ yields correct decomposition, 
i.e., the column space of hf is equal to the column space of 
$L and the non-zero columns of Cf and $Cg are at the same 
locations. 

Guarantee for (a): 

It suffices to show that the rank of V^S is equal to r. 
According to the proof of Lemma if we set 

2r 

7771 > 10 log — (55) 

5 

then. 


' N 2 

V^ss^v-^I 

mi 

N 2 



<5. 


(56) 


If (Ti and ar denote the largest and smallest singular values 
of S^V, respectively, then 


7771 , , 3?77i 


(57) 


with probability at least 1 — 5. Accordingly, the matrix V^S 
is a full rank matrix with probability at least 1 — 5. In addition, 
we study the row space coherency of matrix Lg since it is used 
to derive the guarantee for (c). The projection of the standard 
basis onto the row space of Lg can be written as 


max||PsTvei||2 = max||S'^V(V^SS^V)-iV'^Sei||^ 

i i 

< max||S'^V(V^SS'^V)-iV'^eJ|2 

3 


< ||S^V(V^SS^V)-^f IIV'^Oj 

^ _ /7t,r 6A^2 _ 

~ N2 (Ty N2 mi mi 


2 

2 


(58) 


where (S^V(V^SS^V)“^V^S) is the projection matrix 
onto the column space of S^V. The first inequality follows 
from the fact that {Sei}™\ is a subset of The second 

inequality follows from Cauchy-Schwarz inequality and the 
third inequality follows from ([57|. 


Guarantee for (b): 

Suppose that (a) is true. If q is the number of outliers of Dg, 
Lemma [T0| provides a sufficient condition for m 2 (inequality 
(26 1 ) to guarantee that these requirements are satisfied. 


Guarantee for (c): 

Suppose (a) and (b) are satisfied. Lirst, let us review the theo¬ 
retical result provided in which supports the performance 
of the convex algorithm (j^. 


Lemma 16. Suppose D follows Data model 1 and define L* 
and C* as the optimal point 0/0- If 


K < 


1-f (121/9)r7 


N 2 and A = 




(59) 


then the column space of L* is equal to the column space of 
L and the location of non-zero columns of C* indicate the 
location of non-zero columns of C. 


The matrix Df can be expressed as Df = Lf -f Cf, where 
Lf = $LS and Cf = $CS. If the rank of $L is equal to r, 
then Lg and L^ have the same row space. Thus, if is an 
orthonormal basis for the row space of Lf, then from (58 1 


max 




l2< 


Qrpy 

mi 


(60) 


Define 77 ^.#. as the number of non-zero columns of Lf. 
Therefore, 


max 


II (Vf)^ 


l2< 


6rpy 

nil 


Qrpy 


(61) 


Suppose 7771 = C^- According to Lemma if C ^ 
3g^ ^ log I, then the number of outlying columns of is 
less than or equal to 


c 



+ 


1 

9 


(62) 
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with probability at least (1 — S) where g can be any number 
greater than one. Therefore, if mi > 


N2 

n!: 


(3(?"^log|) 


g > ^ (1 + 6r/ri,(121/9)) , 


and 


(63) 


then according to Lemma 16 the column space of hf is equal 


(64) 


to the column space of $L and the non-zero columns of Cf 
and are at the same locations provided that 

^ ^ p^-(l + 6rM„(121/9)) 

N 2 ~ g(l + 6r^„(121/9)) 

Therefore, if the requirements of Theorem are satished. 
Algorithm 2 with RED extracts the exact subspace with 
probability at least 1 — 3i5. In addition, according to the 
analysis provided in the proof of Lemma if m 2 satishes 
the requirement of Theorem then the columns of do 
not lie in the column space of $L, and the non-zero columns 
of 4>C do not lie in the column space of 4>L whp, i.e., if 
the exact subspace is retrieved, 0 identihes the outlying 
columns correctly whp. 

Proof of Theorem |4] 

The proof of Theorem is similar to the proof of Theorem 
1^ But, we use Lemma [TT] to establish a sufficient condition 
on m 2 to guarantee (b). In addition, according to the analysis 


in the proof of Lemma 11 if m 2 satishes the requirement of 


Theorem]^ not only is Requirement 2 satished whp, but also 
the non-zero columns of $C do not lie in the column space 
of $L whp, i.e., identihes the outlying columns correctly 
whp in case of exact subspace recovery. 
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